With some free time in my hands in between Coursera courses and classes not starting for the next couple of weeks, I wanted to use some of the new Azure Data Lake services and build a Big Data analytics proof of concept based on a large public dataset. So I decided to create these series of posts to document the experience and see what can be created with them.
To also play with some new shiny tools recently available, I made all of these steps using thew new Ubuntu Bash on Windows 10 and Azure CLI. Now that Bash is available on Windows, I think the Azure CLI is the best tool to use, as scripts created with it can be run both on Windows and Linux without any modifications. (In other multi-plaform and OSS news, Microsoft also recently announced the availability of Powershell on Linux, but I still think that using bash makes more sense than PowerShell).
What is Azure Data Lake
Azure Data Lake is a collection of services to help you create your own Data Lake and run analytics on its data. The two services are called “Azure Data Lake Store” and “Azure Data Lake Analytics”. Why would you use this as opposite of creating your own on-premise data lake? Cost is the first reason that comes to mind, as with any cloud based offering. The smart idea of these two services is that you can scale up storage independently of compute, whereas with an on-prem Hadoop Cluster you would be scaling both hand-in-hand. With Azure Data Lake you can store as much data as you need and only use the analytics engine when required.
To use these services you need an Azure subscription and request access to the preview version of the Azure Data Lake Store and Azure Data Lake Analytics services. The turnaround time to get approved is pretty quick, around an hour or so.
What is the difference between Azure Data Lake Store and Blob Storage
Azure Data Lake Store has some advantages when compared to Blob Storage: it overcomes some of its space limitations and can theoretically scale up to infinite. You can run Data Lake Analytics jobs using data stored in either Blob Storage or Data Lake Store, but apparently you should get much better performance using Data Lake Store.
Also cost is another differential. Blob storage is cheaper than Data Lake Store.
Summary: Use Blob Storage for large files that you are going to be keeping for the long time. Copy your files to the Data Lake Store only when you need to run Analytics on them.
Data set: Reddit Public comments
I found this very interesting site called Academic Torrents where you can find a list of public large datasets for academic use. The reddit dataset is about 160GB compressed in bz2 files and composed of about 1.7 billion JSON comment objects from reddit.com between October 2010 and May 2015. The great thing about it is that is split into monthly chunks (one file per month) so you can just download one month of data and start working right away.
To download the contents you can use your downloader of choice (Also I only downloaded the files for year 2007 to run this proof of concept).
Setting up the Azure Data Lake Store
To run all these steps you first need to have the Azure CLI available in the Ubuntu Bash.
1. First step is to install Node.js. You can skip this if you have node already installed, or you are running this somewhere with Node.js already installed.
curl -sL https://deb.nodesource.com/setup_4.x | sudo -E bash -
sudo apt-get install -y nodejs
2. Then you need to download and install the Azure CLI
wget aka.ms/linux-azure-cli -O azure-cli.tar.gz
gzip -d ./azure-cli.tar.gz
sudo npm install -g ./azure-cli.tar
3. Run some validation to see the CLI got installed correctly
azure help
azure --version
4. Now you need to connect the CLI to your subscription, and set it into Resource Manager Mode
azure login
azure config mode arm
5. If you don’t have a resource group, or you want to create a new one just for this. In this case, it is named dataRG
azure group create -n "dataRG" -l "Canada East"
6. Next, you need to register the Data Lake Store and Data Lake Analytics providers with your subscription.
azure provider register Microsoft.DataLakeStore
7. Create an Azure Data Lake store account. Keep in mind the service is only available on the East US 2 region so far. The account name in this case is redditdata
azure datalake store account create redditdata eastus2 dataRG
8. Create a folder. Here, I’m creating a folder “2007” to store the files from that year.
azure datalake store filesystem create redditdata 2007 --folder
9. As the files downloaded are compressed in bz2, first expand them. I only expanded one of them as I may want to try out using Azure Data Factory to do this.
bzip2 -d ./RC_2007-10.bz2
10. Upload files to the Data Lake store folder. In this case the uploads are the expanded file from the previous step and one of the compressed files.
azure datalake store filesystem import redditdata ./RC_2007-10 "/2007/RC_2007-10.json"
azure datalake store filesystem import redditdata ./RC_2007-11.bz2 "/2007/RC_2007-11.bz2"
After all these steps, you should have both files (compressed bzip2 and uncompressed json) uploaded to the Data Lake store.
Setting up Azure Data Lake Analytics
1. Register data lake analytics provider for your subscription. This is similar to what we have done in step 6 but now for Data Analytics. If you have this enabled, you many not need it at all.
azure provider register Microsoft.DataLakeAnalytics
2. Create an account. In this case I’m calling it “redditanalytics”, the region is still East US 2, and I’m using the dataRG resource group and the redditdata Data Lake Store, both of them created in the previous steps.
azure datalake analytics account create "redditanalytics" eastus2 dataRG redditdata
Summary
With all these steps we just setup the stage to dive deep into doing analytics on the data. That will come in a future post, as I’m currently figuring out how to do it. But so far we proved that using the Azure CLI in the Windows Bash works pretty well, and you can manage most (if not all) of your subscription through it. Azure Data Lake Store seems like a service created to exclusively work paired to the Data Lake Analytics, so I still have to see if the value delivered justifies using it.